using StatsBase, DataFrames, DataFrameMacros
using RCall, CategoricalArrays, TexTables
using Distributions, Gadfly, Compose, MLJ
using Effects, StandardizedPredictors
using GLM, MLJGLMInterface, StableRNGs
import AnovaGLM as aov
using MLJ: schemaStatistical Modelling
2 Introduction
2.1 Initial data manipulation
We will look at data from 500 singleton births in a London Hospital. We will estimate the effect of several potential predictors on the birth weights of babies.
R"""
data(births, package = "Epi")
""";@rget births
births |> dropmissing!
births |> schema┌─────────┬────────────┬─────────┐ │ names │ scitypes │ types │ ├─────────┼────────────┼─────────┤ │ id │ Continuous │ Float64 │ │ bweight │ Continuous │ Float64 │ │ lowbw │ Continuous │ Float64 │ │ gestwks │ Continuous │ Float64 │ │ preterm │ Continuous │ Float64 │ │ matage │ Continuous │ Float64 │ │ hyp │ Continuous │ Float64 │ │ sex │ Continuous │ Float64 │ └─────────┴────────────┴─────────┘
In this tutorial, we have a first look at machine learning with MLJ. MLJ requires that all our variables are of scitype Continuous. For convenience, I will transform categorical variables in R and save it as a different data frame.
R"""
births2 = births %>%
mutate(
low = factor(lowbw, labels=c('No', 'Yes')),
preterm = factor(preterm, labels=c('No', 'Yes')),
hyp = factor(hyp, labels=c('No hypertension', 'Hypertension')),
sex = factor(sex, labels=c('Male', 'Female'))
) %>%
na.omit() %>%
var_labels(
bweight = "Birth weight (g)",
low = "Birth weight < 2.5 kg?",
gestwks = "Gestation period (weeks)",
preterm = "Gestation period < 37 weeks?",
matage = "Maternal age (years)",
hyp = "Maternal hypertension",
sex = "Sex"
)
""";@rget births2
DataFrames.rename!(
births2,
:gestwks => :Gestation,
:matage => :Age,
:hyp => :Hypertension,
:sex => :Sex
);2.2 Descriptive statistics
Let’s start by looking at the table of descriptive statistics.
R"""
births2 %>%
select(-c(id, lowbw)) %>%
tbl_summary() %>%
cosm_sum(bold=TRUE) %>% theme_pubh() %>%
set_caption("Descriptive statistics of singleton births in a London Hospital.") %>%
print_html()
""";|
Variable |
N = 490 |
|---|---|
| Birth weight (g) | 3,188 (2,860, 3,554) |
| Gestation period (weeks) | 39.13 (37.94, 40.09) |
| Gestation period < 37 weeks? | 63 (13%) |
| Maternal age (years) | 34.0 (31.0, 37.0) |
| Maternal hypertension | |
| No hypertension | 419 (86%) |
| Hypertension | 71 (14%) |
| Sex | |
| Male | 256 (52%) |
| Female | 234 (48%) |
| Birth weight < 2.5 kg? | 59 (12%) |
Construct QQ-Norm plots to check the distributions of birth weight and gestational period.
Code
let
p1 = qq_plot(births.bweight, ylab = "Birth weight (g)")
p2 = qq_plot(births.gestwks, ylab = "Gestational weeks")
hstack(p1, p2)
end2.3 Linearity
Let’s graphically check if there is a linear relationship between gestation time and birth weight.
plot(
births,
x = :gestwks,
y = :bweight,
Geom.point(), Geom.smooth(smoothing = 0.9),
Guide.xlabel("Gestational weeks"),
Guide.ylabel("Birth weight (g)")
)What is your observation about linearity from your scatter plot?
There is a linear relationship, but only for the middle part of the curve. The curve is clearly non-linear and more like an S-shape characteristic of human growth.
Let’s construct the plot for the section where linearity is more evident.
plot(
@subset(births, :gestwks > 31, :gestwks < 40),
x = :gestwks,
y = :bweight,
Geom.point(), Geom.smooth(smoothing = 0.9),
Guide.xlabel("Gestational weeks"),
Guide.ylabel("Birth weight (g)")
)Let’s generate a new dataset with the subset of observations that are within the linear section of the relationship between gestation period and birth weight. For this subset, we will select only variables of interest.
births |> names8-element Vector{String}:
"id"
"bweight"
"lowbw"
"gestwks"
"preterm"
"matage"
"hyp"
"sex"
babies = @select(
@subset(births, :gestwks .> 31, :gestwks .< 40),
{Not([:id, :preterm, :lowbw])}
)
rename!(
babies,
:gestwks => :Gestation,
:matage => :Age,
:hyp => :Hypertension,
:sex => :Sex
)
babies |> head| Row | bweight | Gestation | Age | Hypertension | Sex |
|---|---|---|---|---|---|
| Float64 | Float64 | Float64 | Float64 | Float64 | |
| 1 | 2974.0 | 38.52 | 34.0 | 0.0 | 2.0 |
| 2 | 2620.0 | 38.15 | 35.0 | 0.0 | 2.0 |
| 3 | 3751.0 | 39.8 | 31.0 | 0.0 | 1.0 |
| 4 | 3200.0 | 38.89 | 33.0 | 1.0 | 1.0 |
| 5 | 3405.0 | 39.33 | 39.0 | 0.0 | 1.0 |
babies_cat = @select(
@subset(births2, :Gestation .> 31, :Gestation .< 40),
{Not([:id, :lowbw, :preterm, :low])}
)
babies_cat |> head| Row | bweight | Gestation | Age | Hypertension | Sex |
|---|---|---|---|---|---|
| Float64 | Float64 | Float64 | Cat… | Cat… | |
| 1 | 2974.0 | 38.52 | 34.0 | No hypertension | Female |
| 2 | 2620.0 | 38.15 | 35.0 | No hypertension | Female |
| 3 | 3751.0 | 39.8 | 31.0 | No hypertension | Male |
| 4 | 3200.0 | 38.89 | 33.0 | Hypertension | Male |
| 5 | 3405.0 | 39.33 | 39.0 | No hypertension | Male |
2.4 Interactions
We will check for potential interactions graphically. Let’s start with sex.
plot(
babies_cat,
x = :Gestation,
y = :bweight,
color = :Sex,
Geom.point, Geom.smooth(method = :lm),
Guide.xlabel("Gestational weeks"),
Guide.ylabel("Birth weight (g)")
)Do you think there is graphical evidence of interaction with sex?
The fitted lines are not strictly parallel, but there is no strong evidence of any interaction.
Determine if there is graphical evidence of hypertension effects on the relationship between gestation time and birth weight.
Code
plot(
babies_cat,
x = :Gestation,
y = :bweight,
color = :Hypertension,
Geom.point, Geom.smooth(method = :lm),
Guide.xlabel("Gestational weeks"),
Guide.ylabel("Birth weight (g)")
)3 GLM Approach
3.1 Model simplification
We can start model simplification using ANOVA.
model_1 = lm(
@formula(bweight ~ Gestation + Age + Hypertension + Sex),
births2
)
aov.anova(model_1; type=3)Analysis of Variance
Type 3 test / F test
bweight ~ 1 + Gestation + Age + Hypertension + Sex
Table:
───────────────────────────────────────────────────────────────
DOF Exp.SS Mean Square F value Pr(>|F|)
───────────────────────────────────────────────────────────────
(Intercept) 1 21680318.92 21680318.92 113.3498 <1e-23
Gestation 1 91012707.58 91012707.58 475.8360 <1e-73
Age 1 24612.93 24612.93 0.1287 0.7200
Hypertension 1 1519231.86 1519231.86 7.9429 0.0050
Sex 1 4734298.72 4734298.72 24.7520 <1e-06
(Residuals) 485 92765500.86 191269.07
───────────────────────────────────────────────────────────────
Let’s remove Age from the model and define our contrasts for the categorical variables.
bw_cont = Dict(
:Hypertension => EffectsCoding(),
:Sex => EffectsCoding()
);model_2 = lm(
@formula(bweight ~ Gestation + Hypertension + Sex),
births2,
contrasts = bw_cont
)
aov.anova(model_2; type=3)Analysis of Variance
Type 3 test / F test
bweight ~ 1 + Gestation + Hypertension + Sex
Table:
───────────────────────────────────────────────────────────────
DOF Exp.SS Mean Square F value Pr(>|F|)
───────────────────────────────────────────────────────────────
(Intercept) 1 31652620.83 31652620.83 165.7846 <1e-32
Gestation 1 90992572.27 90992572.27 476.5851 <1e-73
Hypertension 1 1500405.49 1500405.49 7.8586 0.0053
Sex 1 4714027.66 4714027.66 24.6903 <1e-06
(Residuals) 486 92790113.78 190926.16
───────────────────────────────────────────────────────────────
glm_coef(model_2 |> coeftable |> DataFrame, ratio=false)| Row | Name | Coef. | Lower 95% | Upper 95% | Pr(>|t|) |
|---|---|---|---|---|---|
| String | Float64 | Float64 | Float64 | Float64 | |
| 1 | (Intercept) | -4319.76 | -4978.97 | -3660.56 | 0.0 |
| 2 | Gestation | 191.005 | 173.814 | 208.196 | 0.0 |
| 3 | Hypertension: Hypertension | -80.664 | -137.202 | -24.126 | 0.005 |
| 4 | Sex: Female | -98.382 | -137.285 | -59.479 | 0.0 |
4 MLJ Continuous variables
4.1 Response and predictors
Our first model will contain all relevant predictors with no interaction terms. We can use unpack to define response and explanatory variables.
@select!(babies, :bweight, :Gestation, :Hypertension, :Sex)
babies |> head| Row | bweight | Gestation | Hypertension | Sex |
|---|---|---|---|---|
| Float64 | Float64 | Float64 | Float64 | |
| 1 | 2974.0 | 38.52 | 0.0 | 2.0 |
| 2 | 2620.0 | 38.15 | 0.0 | 2.0 |
| 3 | 3751.0 | 39.8 | 0.0 | 1.0 |
| 4 | 3200.0 | 38.89 | 1.0 | 1.0 |
| 5 | 3405.0 | 39.33 | 0.0 | 1.0 |
y1, x1 = unpack(babies, ==(:bweight))
x1 |> head| Row | Gestation | Hypertension | Sex |
|---|---|---|---|
| Float64 | Float64 | Float64 | |
| 1 | 38.52 | 0.0 | 2.0 |
| 2 | 38.15 | 0.0 | 2.0 |
| 3 | 39.8 | 0.0 | 1.0 |
| 4 | 38.89 | 1.0 | 1.0 |
| 5 | 39.33 | 0.0 | 1.0 |
4.2 Data splitting
We will split the data, in particular, the outcome (response) in two:
- train : 70 % of the data
- test : 30 % of the data
using StableRNGs
rng = StableRNG(566)
train, test = partition(eachindex(y1), 0.7, shuffle=true, rng=rng);4.3 Model
LinearRegressor = @load LinearRegressor pkg=GLM verbosity=0
glm = LinearRegressor()LinearRegressor(
fit_intercept = true,
dropcollinear = false,
offsetcol = nothing,
report_keys = [:deviance, :dof_residual, :stderror, :vcov, :coef_table])
bw_mach = machine(glm, x1, y1)
fit!(bw_mach, rows=train);glm_coef(report(bw_mach).coef_table |> DataFrame, ratio=false)| Row | Name | Coef. | Lower 95% | Upper 95% | Pr(>|t|) |
|---|---|---|---|---|---|
| String | Float64 | Float64 | Float64 | Float64 | |
| 1 | Gestation | 201.745 | 168.36 | 235.131 | 0.0 |
| 2 | Hypertension | -164.132 | -320.005 | -8.26 | 0.039 |
| 3 | Sex | -177.601 | -292.207 | -62.996 | 0.003 |
| 4 | (Intercept) | -4374.53 | -5669.29 | -3079.76 | 0.0 |
5 MLJ Categorical variables
In the previous section, we used a continuous version for all predictors. Let’s see what happens when we work with the categorical, original versions of the predictors.
5.1 One Hot Encoding (OHE)
One hot encoding is a process of converting categorical variables to form multiple numerical columns as there are categories. This is done so that the variables can be fed into ML algorithms to do a better job in prediction.
As before, first, we unpack data into response (outcome) and explanatory (exposure/predictor) variables.
@select!(babies_cat, :bweight, :Gestation, :Hypertension, :Sex)
y2, x2 = unpack(babies_cat, ==(:bweight))
x2 |> head| Row | Gestation | Hypertension | Sex |
|---|---|---|---|
| Float64 | Cat… | Cat… | |
| 1 | 38.52 | No hypertension | Female |
| 2 | 38.15 | No hypertension | Female |
| 3 | 39.8 | No hypertension | Male |
| 4 | 38.89 | Hypertension | Male |
| 5 | 39.33 | No hypertension | Male |
babies_cat |> schema┌──────────────┬───────────────┬──────────────────────────────────┐
│ names │ scitypes │ types │
├──────────────┼───────────────┼──────────────────────────────────┤
│ bweight │ Continuous │ Float64 │
│ Gestation │ Continuous │ Float64 │
│ Hypertension │ Multiclass{2} │ CategoricalValue{String, UInt32} │
│ Sex │ Multiclass{2} │ CategoricalValue{String, UInt32} │
└──────────────┴───────────────┴──────────────────────────────────┘
5.2 Model
In OneHotEncoder, all categorical variables are included by default (features argument). We want to drop one of the levels to use as a reference, we achieve this with the argument drop_last=true. We are going to include the OneHotEncoder as part of a pipe line.
ContinuousEncoder converts all categorical variables to continuous ones, creating new variables for the levels and makes sure that all numerical variables are indeed continuous ones. Instead of centring, we will use a scaler, one that standardises numerical variables.
pipe = ContinuousEncoder(drop_last=true) |> Standardizer() |> glmProbabilisticPipeline(
continuous_encoder = ContinuousEncoder(
drop_last = true,
one_hot_ordered_factors = false),
standardizer = Standardizer(
features = Symbol[],
ignore = false,
ordered_factor = false,
count = false),
linear_regressor = LinearRegressor(
fit_intercept = true,
dropcollinear = false,
offsetcol = nothing,
report_keys = [:deviance, :dof_residual, :stderror, :vcov, :coef_table]),
cache = true)
As we dropped the last level, we need to change the levels of categorical predictors so now exposed ones become the reference and their coefficients are shown in the coefficient table.
levels!(x2.Hypertension, ["Hypertension", "No hypertension"])
levels!(x2.Sex, ["Female", "Male"]);mach = machine(pipe, x2, y2)untrained Machine; does not cache data
model: ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …)
args:
1: Source @211 ⏎ Table{Union{AbstractVector{ScientificTypesBase.Continuous}, AbstractVector{Multiclass{2}}}}
2: Source @314 ⏎ AbstractVector{ScientificTypesBase.Continuous}
fit!(mach, rows=train)
glm_coef(
report(mach).linear_regressor.coef_table |> DataFrame,
ratio=false
)| Row | Name | Coef. | Lower 95% | Upper 95% | Pr(>|t|) |
|---|---|---|---|---|---|
| String | Float64 | Float64 | Float64 | Float64 | |
| 1 | Gestation | 347.848 | 290.285 | 405.412 | 0.0 |
| 2 | Hypertension__Hypertension | -60.574 | -118.1 | -3.048 | 0.039 |
| 3 | Sex__Female | -88.856 | -146.195 | -31.517 | 0.003 |
| 4 | (Intercept) | 3023.65 | 2966.53 | 3080.77 | 0.0 |
We can compare the coefficients against those from the GLM approach:
glm_coef(model_2 |> coeftable |> DataFrame, ratio=false)| Row | Name | Coef. | Lower 95% | Upper 95% | Pr(>|t|) |
|---|---|---|---|---|---|
| String | Float64 | Float64 | Float64 | Float64 | |
| 1 | (Intercept) | -4319.76 | -4978.97 | -3660.56 | 0.0 |
| 2 | Gestation | 191.005 | 173.814 | 208.196 | 0.0 |
| 3 | Hypertension: Hypertension | -80.664 | -137.202 | -24.126 | 0.005 |
| 4 | Sex: Female | -98.382 | -137.285 | -59.479 | 0.0 |
The difference in the coefficients between the GLM model (model_2) and the MLJ model (mach), is that in the former, Gestation is centred, while in the latter is standardized (centred and scaled).
6 Diagnostics
6.1 Normality
bw_perf = model_perf(model_2);resid_plot(bw_perf)6.2 Homoscedasticity
rvf_plot(bw_perf)6.3 Influential observations
cooks_plot(bw_perf)cook_lev_plot(bw_perf)6.3.1 Collinearity
When more than two variables are involved, it is often called multicollinearity. However, the terms collinearity and multicollinearity are often used interchangeably. The variance inflation factor (VIF) is used to test for collinearity. The rule of thumb is:
- Values larger than 10 give evidence of collinearity.
- A mean of the VIF factors considerably greater than 1 suggests collinearity.
using MixedModelsExtras: gvifr3.(gvif(model_2))3-element Vector{Float64}:
1.05
1.053
1.004
The results from gvif follow the same order as the coefficients of the model, this is:
- Gestational weeks.
- Hypertension.
- Sex.
7 Effect Plots
We can plot all the predictors who have an effect on birth weight.
plot(
babies_cat,
x=:Gestation, y=:bweight,
color=:Sex, xgroup=:Hypertension,
Geom.subplot_grid(Geom.point, Geom.smooth(method=:lm)),
Guide.xlabel("Gestational weeks"),
Guide.ylabel("Birth weight (g)")
)Babies born from mothers with a hystory of hypertension, have on average, a lower birth weight and this is independent of their sex. For the group of babies whose mothers do not have a history of hypertension, girls have on average, a lower birth weight than boys. In all cases, as the number of gestational weeks increase, the birth weight also increases.
coef_plot(
model_2,
[
"Gestational weeks",
"Maternal \n hypertension",
"Sex \n Female - Male"
]
)8 Performance
8.1 Full data set
r2(model_2) |> r30.537
adjr2(model_2) |> r30.534
mape(bw_perf) |> r30.0
rmse(bw_perf) |> r3435.164
8.2 Test data set
ŷ = MLJ.predict(mach, rows=test)
ȳ = predict_mode(mach, rows=test);cross_entropy(ŷ, y2[test]) |> mean |> r37.466
coef_det(ȳ, y2[test]) |> r30.403